In this report I will explore the red wine quality dataset, a dataset that contains 1,599 red wines with 12 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
chlorides: the amount of salt in the wine
free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
density: the density of water is close to that of water depending on the percent alcohol and sugar content
pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
alcohol: the percent alcohol content of the wine
quality (score between 0 and 10)
In this section, we will first observe the structure of the dataset. Then for each variable of the dataset we will plot an histogram to better comprehend the distribution of the variable and a boxplot when needed to better visualize the variability of the variable.
## [1] 1599 12
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
The red wine quality dataset contains 1599 observations and 12 variables : 11 are numerics (based on physicochemical tests) and 1 is an ordered factor (based on sensory data).
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
The score is supposed to be between 0 and 10 but we see that it falls only between 3 and 8.The distribution seems to be normally distributed, with a most common value of 5. More than 96% of the red wine samples have a minimum quality of 5.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
While the values are distributed between 4.60 and 15.90, most of them are between 7.10 and 9.20.
Some values (>14.50) seem to be outliers, we might want to adjust the axes.
The distribution is slightly right skewed so the median of 7.90 is a better measure of the center.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
While the values are distributed between 0.1200 and 1.5800, most of them are between 0.3900 and 0.6400.
Some values (>1) seem to be outliers, we might want to adjust the axes.
The distribution seemed slightly right skewed before, now it looks rather normal with a median approximately equal to the mean of 0.5200. We can see some peaks at 0.42 and 0.56.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
While the values are distributed between 0 and 1, most of them are between 0.090 and 0.420
One value (= 1) seems to be an outlier, we might want to adjust the axes.
The distribution seems slightly right skewed so the median of 0.260 is a better measure of the center. We can see multiple peaks at 0, 0.25 and 0.47.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
While the values are distributed between 0.900 and 15.500, most of them are between 1.900 and 2.600.
Some values (>6.9) seem to be outliers, we might want to adjust the axes.
The distribution looks normal around the peak but is slightly right skewed so the median of 2.200 is a better measure of the center.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
While the values are distributed between 0.01200 and 0.61100, most of them are between 0.07000 and 0.09000.
Some values (> 0.3) seem to be outliers, we might want to adjust the axes.
The distribution looks normal around the peak but is slightly right skewed so the median of 0.07900 is a better measure of the center.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
While the values are distributed between 1 and 72, most of them are between 7 and 21.
Some values (> 58) seem to be outliers, we might want to adjust the axes.
The distribution is right skewed so the median of 14 is a better measure of the center.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
While the values are distributed between 6.00 and 289.00, most of them are between 22.00 and 62.00.
Some values (> 175) seem to be outliers, we might want to adjust the axes.
The distribution is slightly right skewed so the median of 38.00 is a better measure of the center.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
While the values are distributed between 0.9901 and 1.0037, most of them are between 0.9956 and 0.9978.
The distribution seems normally distributed with a mean of 0.9967.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
While the values are distributed between 2.740 and 4.010, most of them are between 3.210 and 3.400.
The distribution seems normally distributed with a mean of 3.311.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
While the values are distributed between 0.3300 and 2.0000, most of them are between 0.5500 and 0.7300.
Some values (> 1.5) seem to be outliers, we might want to adjust the axes.
The distribution is slightly right skewed so the median of 0.6200 is a better measure of the center.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
While the values are distributed between 8.40 and 14.90, most of them are between 9.50 and 11.10.
Some values (> 14) seem to be outliers, we might want to adjust the axes.
The distribution is slightly right skewed so the median of 10.20 is a better measure of the center.
What is the structure of your dataset?
The red wine quality dataset contains 1599 observations and 12 variables : 11 are numerics (based on physicochemical tests) and 1 is an ordered factor (based on sensory data).
What I found :
What is/are the main feature(s) of interest in your dataset?
The main feature of interest in our dataset is the quality. A good question to ask ourself would be to know which variables contribute to a high quality wine.
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
So far I can’t really put aside any variables so I would say that all the other 11 variables can at this stage support my investigation into my feature of interest.
Did you create any new variables from existing variables in the dataset?
I didn’t create a new variable.
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
Some distributions caught my attention. First, the quality is supposed to be between 0 and 10 but the range of the dataset’s ratings is only from 3 to 8 supposing that the extreme ratings are very rare or even impossible. Then, in the citric acid distribution there are 132 observations with a citric acid value of 0. Even though it is stated that it is found in same quantities, it is still 8% of the dataset without citric acid.
The dataset was already tidy and there was no missing values so I did not have to perform any action during the exploration. However, some outliers are present in some distributions so I have to take that into consideration for my further explorations.
Let’s look at a correlation matrix to try to understand the relationship between variables.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## quality 0.013731637 -0.128906560 -0.050656057
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## sulphates alcohol quality
## fixed.acidity 0.183005664 -0.06166827 0.12405165
## volatile.acidity -0.260986685 -0.20228803 -0.39055778
## citric.acid 0.312770044 0.10990325 0.22637251
## residual.sugar 0.005527121 0.04207544 0.01373164
## chlorides 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029
## density 0.148506412 -0.49617977 -0.17491923
## pH -0.196647602 0.20563251 -0.05773139
## sulphates 1.000000000 0.09359475 0.25139708
## alcohol 0.093594750 1.00000000 0.47616632
## quality 0.251397079 0.47616632 1.00000000
For a correlation coefficient r, we define :
We assume that we are only interested in the relationships between all variables that are at least moderate, and for the relationships involving our main feature (quality) the ones that have a correlation coefficient of at least +/-0.2.
Following the previous statement, we notice the following relationships :
| relationship | correlation coefficient | Strength | Direction |
|---|---|---|---|
| citric acid - fixed acidity | 0.67170343 | moderate | positive |
| citric acid - volatile acidity | -0.552495685 | moderate | negative |
| total sulfure - free sulfure | 0.667666450 | moderate | positive |
| density - fixed acidity | 0.66804729 | moderate | positive |
| density - citric acid | 0.36494718 | moderate | positive |
| density - residual sugar | 0.355283371 | moderate | positive |
| ph - fixed acidity | -0.68297819 | moderate | negative |
| ph - citric acid | -0.54190414 | moderate | negative |
| sulphates - citric acide | 0.31277004 | moderate | positive |
| sulphates - chlorides | 0.371260481 | moderate | positive |
| alcohol - density | -0.49617977 | moderate | negative |
| quality - volatile acidity | -0.390557780 | moderate | negative |
| quality - alcohol | 0.47616632 | moderate | positive |
| quality - citric acid | 0.22637251 | weak | positive |
| quality - sulphates | 0.251397079 | weak | positive |
We will see these relationships more in details in the following sections.
It appears that the more citric.acid there is, the more fixed.acidity there is. However, there are a lot of variations when the value of citric.acid increases.
It appears that the more citric.acid there is, the less volatile.acidity there is. However, there are still a lot of variations.
It appears that the more free.sulfur.dioxide there is, the more total.sulfur.dioxide there is. There is a peak around around 37 of free.sulfur.dioxide.
It appears that the more density there is, the more fixed.acidity there is. However there are a lot of variations.
It appears that the more density there is, the more citric.acid there is. However there are a lot of variations.
The relationship looks quite weak, even though there are so peaks we can’t be sure there is a real relationship between these two variables.
It appears that the more pH there is, the less fixed.acidity there is. It is indeed logical as higher values of pH correspond to more basic liquid.
It appears that the more pH there is, the less citric acid there is. It is indeed logical as higher values of pH correspond to more basic liquid.
It appears that the more citric.acid there is, the more sulphates there is, especially after 0.75 of citric.acid where the amount of sulphates increases a lot.
The relationship between sulphates and chlorides is not quite clear. There are some peaks but it seems quite random.
It appears that the more alcohol there is, the less density there is.
We can observe a trend right here : it seems that lower volatile.acidity mean higher quality.
Apart from the value for the quality 5, we can observe a trend right here : it seems that higer alcohol mean higher quality.
We can observe a trend right here : it seems that higher citric.acid mean higher quality.
We can also observe a trend right here : it seems that higher sulphates mean higher quality.
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
I started my analysis by creating a correlation matrix in order to understand better the relationship between variables.
I narrowed the number of relationships that I was interested in by only keeping the relationships between all variables that are at least moderate, and for the relationships involving our main feature (quality) the ones that have a correlation coefficient of at least +/-0.2. I did that so I could focus only on the most predominent relationships.
Of the 15 relationships kept for exploration, 4 concerned the quality variable and I observed these trends :
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
Of the 15 relationships kept for exploration, 11 concerned the other features and I observed these trends :
What was the strongest relationship you found?
Concerning the feature of interest, the strongest relationship that I found was between the quality and the alcohol with a correlation coefficient of 0.47616632, meaning it is a moderate positive relationship.
Concerning the other features, the strongest relationship that I found was between the ph and the fixed acidity with a correlation coefficient of -0.68297819, meaning it is a moderate (almost strong) negative relationship. It is indeed logical as higher values of pH correspond to more basic liquid.
First I will try to visualize relationships between the feature of interest and 2 other features, then I will try to visualize relationships between 3 other features.
# create a quality_rating variable that classify the quality in 3 categories
wine_df$quality_rating <- ifelse(wine_df$quality < 5, 'Bad',
ifelse(wine_df$quality < 7,
'Average', 'Good'))
wine_df$quality_rating <- ordered(wine_df$quality_rating,
levels = c('Bad', 'Average', 'Good'))
In this new section, I created a new variable quality_rating containing the rating of the wine (Bad, Average or Good) according to the quality so I could facet wrap any future visualization with that variable.
High quality wines seems to have high alcohol and low volatile acidity.
High quality wines seems to have low volatile acidity and high citric acid.
High quality wines seems to have high sulphates and low volatile acidity.
We can see that high alcohol tends to high quality wines but we can’t really say anything about the citric acid here.
High quality wines seems to have high alcohol and high sulphates.
We can see that high sulphates tends to high quality wines but we can’t really say anything about the citric acid here.
High quality wines seems to have low chlorides and low volatile acidity.
High quality wines seems to have low volatile acidity and low density (even though the relationship seems to be week for the density).
There doesn’t seem to be any meaningful relationship between the alcohol and residal sugar.
High fixed acidity and low volatile acidity tends to high citric acid.
High citric acid and high fixed acid tends to highest density
Low citric acid and low fixed acidity tends to high pH.
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
The following combinations seems to contribute to a high quality wine :
These were the relationships that were the easiest to find.
Were there any interesting or surprising interactions between features?
The thing that surprised me the most is that there is no meaningful relationship between residual sugar and alcohol.
The plot above is a bar plot showing the distribution of the quality (from 0 to 10) in the red wines dataset.
The quality is supposed to be between 0 and 10 but we see that it falls only between 3 and 8. Maybe no such things as really good wines or really bad wines exist, or maybe the dataset doesn’t have these wines. Additionally, more than 96% of the red wine samples have a minimum quality of 5, meaning there are not a lot of bad wines in the dataset.
This plot shows a box plot of the alcohol percentage for each quality. This is a good way to represent the relationship between the alcohol and the quality as it allows to see the evolution of the mean and the variablity of the alcohol percentage for each quality.
We notice that apart from the boxplot for quality 5, a trend seems to be emerging : it seems that a higer percentage of alcohol tends to a higher quality.
Apart from the value for the quality 5, we can observe a trend right here : it seems that higher alcohol percentage leads to higher quality.
This scatterplot shows the relationship between the alcohol percentage and the volatile acidity, while showing at the same time the quality of each observation.
We notice two clusters of points :
We can then state that high quality red wines tend to have high alcohol percentage (as seen previously) but also low volatile acidity.
This observation is not surprising as a too high level of volatile acidity can lead to an unpleasant, vinegar taste, leading to a lower quality.
This project was interesting because it allowed me to put into practice the different steps of Exploratory Data Analysis with a powerful language like R.
The dataset I worked on is the red wine quality dataset. This dataset contains 1599 observations and 12 variables : 11 are numerics (based on physicochemical tests) and 1 is an ordered factor (based on sensory data).
First of all, I did an univariate exploration. First, I observed the structure of the dataset by displaying its dimensions, and the types of its variables. Then, for each of the variables in the dataset, I displayed a summary and its histogram to get an overview of its distribution. This allowed me to know how it was distributed (right skewed, left skewed or normal) and if there were any outliers. It also allowed me to strengthen my understanding of the dataset. After this exploration, I chose to focus mainly on the quality feature, and I asked myself what were the variables contributing to a high quality wine.
After that, I did a bivariate exploration. I started my analysis by creating a correlation matrix in order to understand better the relationship between the variables. I narrowed the number of relationships that I was interested in by only keeping the relationships between all variables that are at least moderate, and for the relationships involving the main feature (quality) the ones that have a correlation coefficient of at least +/-0.2. I did that so I could focus only on the most predominent relationships. Concerning the main feature, I ended up with 4 relationships : less volatile acidity means higher quality, more alcohol means higher quality, more citric acid means higher quality and more sulphates means higher quality.
On the final part of the EDA I did a multivariate exploration. Since there were many variables to consider and many variable associations that could be made, I first decided to focus on the relationships involving the variable of interest (quality) and 2 other variables. I could for instance understand that high quality wines seems to have high alcohol, low volatile acidity (responsible of vinegar taste at high quantity so it is logical) and high sulphates. Then, I focused myself of a few set of 3 other variables that showed correlations in the bivariate exploration. I could for instance understand that low acid citric and low fixed acidity tends to high pH, which is logical as higher values of pH correspond to more basic (less acid) solutions.
During this project, I encountered difficulties mainly when interpreting the plots of the Multivariates Exploration. Indeed, when a third variable is added, the plot sometimes immediately becomes less clear and the relationships much less obvious to determine. To counter this problem, I have created a new variable quality_rating which classifies the quality variable into 3 categories “bad”, “average” and “good”. I then plotted the variables again but made a facet_wrap with this new quality_rating variable. This allowed me to focus only on wines with good quality_rating, and it helped me to discover some trends.
Among the successes of this project, I was especially surprised at how much the correlation matrix helped me in this exploration. It allowed me to guide my analysis and discover patterns and trends. This is definitely something I will rely on in my future analyses.
In the future, the analysis could be enriched by combining the red wine quality dataset with the white wine quality dataset. It might be interesting to determine the commonalities and differences between these two datasets, and this could also allow us to discover new insights about the 2 datasets.